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ABSTRACT 

A  brief  discussion  of  the  literature  concerned  with  the 
two-population  discrimination  problem  is  presented  and  sev- 
eral procedures  based  on  the  likelihood  ratio  for  discrim- 
ination between  negative  exponentially  distributed  populations 
are  proposed.   The  small  sample  and  asymptotic  performance  of 
these  procedures  is  compared  with  that  of  non-oarametric 
procedures  and  the  classical  linear  discriminant  function. 
Some  guidelines  for  the  use  of  the  procedures  discussed  are 
presented. 
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I.   INTRODUCTION 

The  problem  of  classification  arises  when  one  or  more 
measurements  are  made  on  an  individual  and  one  wishes  to  clas- 
sify the  individual  as  belonging  to  one  of  a  finite  number  of 
categories  on  the  basis  of  these  measurements.   Each  category 
is  characterized  by  a  probability  distribution  of  the  measure- 
ments, but  the  proper  category  of  the  individual  is  not  ob- 
servable; it  must  be  inferred  from  the  measurements.   Thus  the 
problem,  in  abstract  terms,  is:   given  an  observation  of  a 
random  variable  arising  from  one  of  several  noDulations ,  find 
a  rule  for  deciding  from  which  pooulation  the  observation 
came. 

The  classification  problem  is,  then,  one  of  finding  an 
appropriate  "statistical  decision  function."   We  have  a  num- 
ber of  hypotheses:   each  hypothesis  is  that  the  distribution 
of  the  observation  is  that  corresponding  to  a  aiven  oooula- 
tion,  and  one  of  these  hypotheses  must  be  selected,  the 
others  rejected. 

In  the  classification  problem,  there  are  essentiallv 
three  levels  of  information  about  the  distributions  corre- 
sponding to  the  various  populations  which  may  be  available 
to  the  statistician. 

1.  the  distributions  may  be  comoletely  known 

2.  the  distributions  may  be  known  to  belong  to  a 
given  family  indexed  by  a  parameter  which  is 
unknown 

3.  the  distributions  may  be  completely  unknown 


In  cases  2)  and  3),  information  about  the  value  of  the  param- 
eter or  about  the  unknown  distribution  is  usually  available 
from  a  sample  or  sequence  of  realizations  of  the  random  var- 
iable corresponding  to  each  population. 

In  the  investigations  reported  in  this  thesis,  the  in- 
dividual to  be  classified  belongs  to  one  of  two  populations. 
In  this  situation,  case  1)  above  is  equivalent  to  the  simple 
vs.  simple  hypothesis  testing  problem  whose  solution  is  given 
by  the  Neyman-Pearson  Lemma.   Case  2)  has  received  relatively 
little  attention  except  under  the  assumption  that  the  family 
of  distributions  is  multi-variate  normal  with  the  same  (but 
unknown)  co-variance  matrix.   The  distribution  of  the  statis- 
tics arising  in  this  situation  have  been  derived.   In  addi- 
tion, Hoel  and  Peterson  (5)  have  derived  very  general  con- 
ditions under  which  procedures  using  sanrale  estimates  of 
the  parameters  are  asymptotically  optimal.   Case  3)  was 
first  considered  by  Fix  and  Hodges  in  1951. 

In  Section  II  of  this  thesis  the  non-parametric  proce- 
dure proposed  by  Fix  and  Hodges  (2,3)  and  the  application  of 
this  procedure  when  the  distribution  of  the  random  variables 
is  negative  exponential  will  be  reviewed.   A  bound  on  the 
error  probabilities  of  the  Fix-Hodges  procedure  discovered 
by  Cover  and  Hart  (1)  and  a  more  general  procedure  proposed 
by  Lof tsgaarden  and  Quesenbury  (6)  will  also  be  examined. 

Section  III  will  present  the  results  of  a  study  of  a 
Likelihood  Ratio  discrimination  procedure  in  case  2)  above 
and  a  comparison  of  the  performance  of  the  various  procedures 
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considered  in  this  thesis  when  the  random  variables  have  the 
univariate  negative  exponential  distribution.  In  Section  IV 
conclusions  and  recommendations  arising  from  this  studv  will 
be  presented. 
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II.   REVIEW  OF  LITERATURE 

Notation  and  Definitions 

In  considering  the  classification  problem,  the  following 

structure  will  be  assumed.   The  two  categories  or  populations 

have  distribution  functions  F  and  G,  and  without  loss  of 

generality,  since  the  measures  with  cumulative  distribution 

functions  F  and  G  are  absolutely  continuous  with  respect  to 

that  given  by  F  +  G,  the  density  functions  f  and  g  will  be 

supposed  to  exist.   Random  samples  from  the  two  distributions 

are  available:   X,  ,...,X   and  Y,  ,...,  Y   independent  and 

I'm      1       n      - 

identically  distributed  as  F  and  as  G  respectively;  they  may 
be  used  to  obtain  information  about  the  respective  distribu- 
tions.  An  observation  z  of  the  random  variable  Z  is  made, 
and  the  classification  problem  is  to  decide  whether  Z  is 
distributed  as  F  or  as  G.   The  abbreviation  Z  ^  F  should  be 
read  "Z  is  distributed  as  F."   The  probabilities  of  misclas- 
sification  will  be  designated  as 

P..  =  Pr  {assign  Z  ^  G  |  Z  ^  F} 

P2  =  Pr  {assign  Z  ^  g|z  ^  G} 
In  the  case  that  the  distributions  are  negative  exponential, 

F(x)  =  1  -  e"Ax  and  G(y)  =  1  -  e"yv  . 
Throughout  this  thesis  reference  will  be  made  to  discrim- 
ination procedures  which  tend  to  behave  similarly  in  the  limit; 
that  is  as  the  number  of  sample  observations  uoon  which  they 
are  based  grows  very  large.   This  concept  may  be  made  explicit 
by  introducing  two  notions  of  consistency  defined  by  *'ix  and 
Hodges  (2) : 
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Definition  1; 

The  sequences  of  decision  functions  {A  '}  and  {A  "}  are 

n         n 

said  to  be  consistent  in  the  sense  of  performance  characteris- 
tics if,  whatever  be  the  true  distributions  of  the  random 
variables,  for  any  e  >  0  there  exists  N  so  that  if  m   N  and 
n  >  N 

|Pr{A'  =  6. }  -  Pr{A"  =  6. }|  <  e 
'mi        n     l  ' 

for  every  possible  decision  6 . . 

Definition  2 : 

The  sequences  of  decision  functions  {A'}  and  {A"}  are 
3  n        n 

said  to  be  consistent  in  the  sense  of  decision  functions  if, 

whatever  be  the  true  distributions  of  the  random  variables, 

for  any  e  >  0,  there  exists  N  so  that  if  m  >  n  and  n  >_  N 

Pr{A'  =  A"}  >  1  -  e  . 
m    m 

It  is  clear  that  consistency  in  the  second  sense  implies 

that  in  the  first.   All  proofs  of  consistencv  bv  Fix  and 

Hodges  and  those  in  this  thesis  provide  consistency  in  the 

stronger  sense.   The  modifying  phrase  will  however  be  omitted. 

Discrimination  when  the  distributions  are  completely  known 

When  the  two  distributions  F  and  1  are  completely  known, 

the  problem  of  assigning  an  observation  z  to  one  of  the  two 

may  be  posed  as  a  test  of  the  hypothesis  Z  ^  F  against  the 

alternative  Z  ^  G.   In  this  case,  the  Neyman-Pearson  Lemma 

gives  the  procedure:   Assign  Z  as  distributed  according  to 

F  if 

f (z)    ,    where  t  is  to  be  determined 
g  (z)  '  0  <  t  <  « 
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Assign  Z  ^  F  with  probability  y  if 

f(z)  _  . 
gTzT  "  t     ' 

Otherwise  assign  Z  ^  G.   This  procedure  is  optimal  in  that  for 
any  assigned  probability  of  error  "of  the  first  kind,"  i.e., 
Pr{assign  Z  ^  g|z  ^  F>  =  P,,  the  probability  of  error  "of  the 
second  kind,"  i.e.,  Pr{assign  Z  ^  f|z  ^  G}  =  P2,  of  this 
procedure  is  no  greater  than  that  of  any  other.   The  value  of 
t  is  chosen  in  the  classical  hypothesis-testing  problem  so 
that  the  probability  of  error  of  the  first  kind  is  some 
chosen  value.   Since  the  class  of  Neyman-Pearson  tests  is 
equivalent  to  the  class  of  Bayes  tests,  the  above  procedure 
(for  the  appropriate  choice  of  t)  is  also  optimal  with  re- 
spect to  minimizing  any  given  weighted  sum  of  the  two  error 
probabilities . 

This  procedure  will  be  designated  L(t).   In  the  case 
that  F  and  G  are  negative-exponential  distributions,  the  L(t) 
procedure  is: 

Assign  Z  ^  F  if  and  onlv  if  -  e(y~X)z  >  t  . 

Discrimination  when  the  distributions  are  completely  unknown 

When  nothing  can  be  assumed  about  the  form  of  the  distribu- 
tion corresponding  to  the  two  populations,  the  statistician  has 

only  the  observations  X,  , . . .  ,  X   and  Y.  , . . .  ,Y   from  which  to 
J  1        m       In 

obtain  information  enabling  him  to  classify  Z  appropriatelv. 
The  procedures  which  Fix  and  Hodges  (2)  suggest  involve  the 
estimation  of  the  densities  f  and  g  at  the  ooint  of  interest, 
and  the  use  of  these  estimates  in  the  likelihood  ratio 
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procedure.   The  following  theorem  due  to  Fix  and  Hodges  demon- 
strates the  asymptotic  optimality  of  this  procedure. 

Theorem  1 ;   Let  f  and  g  denote  estimates  of  the  densities  f 

and  g  respectively  and  let  L*(t;f,g)  denote  the  likelihood 

ratio  discrimination  procedure  using  f  and  g  in  place  of  f 

and  g.   If  f    (z)  and  g    (z)  are  consistent  estimates  for 
3       m,n        ^m,n 

f(z)  and  g(z)  for  all  z  except  possibly   for  z  e  N-   where 

f  *<T 

P_(N^   )  =  0  =  P^(N,_   )  then  L*    (t;f,g)  is  consistent  with 
F   f  ,g  G   f  ,g  m,n     '^ 

L(t)  . 

The  problem,  then,  is  reduced  to  that  of  finding  consis- 
tent estimates  of  the  densities  f  and  g.   If  the  observation 
space  is  reduced  to  one  dimension  by  a  non-negative  trans- 
formation p,  such  that  x   ->  x  entails  p  (x  ,x)  •*   0,  and  if, 

n  n 

further,  for  each  z  except  possiblv  for  a  null  set  under  both 
the  F  and  G  distribution  p(X,z)  and  p(Y,z)  are  random  var- 
iables with  continuous  densities  not  both  zero  at  zero,  then 
given  the  observation  z  to  be  classified,  the  observations 

Xl' " * ' 'Xm;  Yl ' ' ' * /Yn  may  be  rePlaced  bY  P (xi  /Z)  /  •  •  •  /  P (xm'z)  ; 
p(Y,  ,z),...,  p (Y  ,z)  and  the  discrimination  involves  non- 
negative  univariate  random  variables.   A  consistent  estimate 
of  the  transformed  densities  is  given  by  the  following  theorem 
of  Fix  and  Hodges. 

Theorem  2 ;   Let  X  and  Y  be  non-negative.   Let  f  and  g  be  pos- 
itive and  continuous  at  0.   Let  k(m,n)  be  a  positive,  integer- 
valued  function  such  that  k(m,n)  -*■  °°,  —  k(m,n)  ->  0  and 

i 

—  k(m,n)  -»  0  as  m,n  -*■  °°  with  —  -*  6  ¥■    0  or  °°.   Define 
n     '  '  n     r 
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U  =  k    smallest  value  of  the  combined  samples  of  X's 

and  Y's 

M  =  number  of  X's  _<  U 

N  =  number  of  Y's  <_  U 

then  —ft  is  a  consistent  estimate  for  f  (0)  and  — —  is  a  consis- 
nU  nU 

tent  estimate  for  g(0). 

The  L*(t,f,g)  procedure  thus  requires:   Assign  Z  ^  F  if 
and  only  if 

i    •  ^ZH  >  f- 
g  "  N/n  -    * 

Performance  of  the  Non-Parametric  Discriminator  with  finite 
samples 

Fix  and  Hodges  (3)  continued  the  investigation  of  their 
non-parametric  discrimination  procedure  by  examining  its  per- 
formance for  small  samples  where  distributions  are  Normal 
with  identical  covariance  matrix;  that  is,  under  conditions  in 
which  the  linear  discriminant  function  is  known  to  be  an  op- 
timal procedure.   The  bulk  of  that  investigation  is  for  uni- 
variate distributions  with  k  (the  total  number  of  the  avail- 
able samples  used  in  the  classification)  equal  one.   This  is 
the  "Rule  of  Nearest  Neighbor":   classify  Z  ^  F  if  and  onlv  if 
z's  nearest  neighbor  is  an  x.   Fix  and  Hodges  obtain  the  mis- 
classification  probability  for  this  procedure  for  a  consider- 
able range  of  sample  sizes  and  for  distance  between  oooulation 
means  of  1,  2  and  3  times  the  standard  deviation.   Limiting 
error  probabilities  (as  m  =  n  ->  °°)  are  obtained  for  k  =  1  and 
k  =  3  with  distance  between  population  means  of  1  to  5  times 
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the  standard  deviation.   Some  results  are  obtained  for  bivar- 
iate  normal  distributions  and  an  estimate  of  the  performance 
of  the  discriminator  for  k  >  3  is  obtained.   One  verv  inter- 
esting result  of  this  investigation  is  that,  regardless  of 
the  underlying  distributions,  as  m  =  n  ->  °°  the  two  error  prob- 
abilities of  the  rule  of  nearest  neighbor  are  eaual  and  no 
greater  than  one-half. 

Hager  (4)  investigated  the  performance  of  the  "rule  of 
nearest  neighbor"  under  the  assumption  that  F  and  G  were  neg- 
ative exponential.   He  contrasted  this  with  the  performance 
under  the  same  conditions,  of  the  linear  discriminant  func- 
tion and  obtained  misclassif ication  probabilities  for  a  wide 
range  of  (equal)  sample  sizes  and  parameter  values  for  the 
latter  procedure  when  F  and  G  were  Gamma  distributions  of 
order  1  to  20.   His  results  in  the  exponential  case  are  in- 
cluded in  Section  III  of  this  thesis. 

Loftsgaarden  and  Quesenbury  (6)  proposed  an  alternative 
density  estimator  to  that  suggested  by  Fix  and  Hodges,  which 
is  consistent  and  applicable  in  a  Euclidean  space  of  any 
dimension.   The  procedure  is  let  j (m)  be  a  sequence  of  inte- 
gers such  that 

lim    j  (m)    =   °° 
m-*-oo 

lira  A =    0 

m 
m->°° 

To  estimate  the  density  at  a  point  z,  using  a  sample  x, , . . . , 

x  ,  let  w , , w . . . ,w ,  N  be  the  trans  formed  sample  I x, -z I  , . . . , 
m        (1)       (m)  -    '  1   ' 

I x  — z I  ordered  from  smallest  to  largest.   Let  A  ...         denote 
1  m   '  ^  w  (-) )  ,  z 
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the  volume  (Lebesgue  measure)  of  the  hypersphere  of  radius 
W/.v  centered  at  z,  then 

(d) 


n     A  ,  .  v 

W("])  ,z 

is  a  consistent  estimate  of  the  density  f  at  the  point  z. 

If  the  density  g  at  z  is  similarly  estimated  based  on 
y  ,... ,y  ,  denoting  the  transformed  sample  by  v(l) , . . . , 
v(l) , . . . , v(n)  (where  I (n)    is  a  sequence  with  the  same  charac- 
teristics as  j  above) ,  then  by  Theorem  1  the  procedure 
L*(t;f,g)  which  requires,  assign  Z  ^   F  if  and  onlv  if 

j-1      1 

m     fi"'j'^   >  t 

£-1       1 


n      A  in  \ 

V(£) , Z 

is  consistent  with  the  procedure  L(t)  and  hence  asymptoticallv 

optimal.   Note  that,  if  t  =  1  and  m  =  n,  j  =  I,    this  procedure 

is  identical  with  the  Fix-Hodges  procedure  with  k  =  j  +  I    -   1 

since  a  majority  of  the  k  nearest  neighbors  of  z  are  x's  if 

and  only  if  w(j)  <  v(&).   In  the  general  case,  the  procedures 

L*(t;f,g)  and  L*(t;f,g)  are  quite  similar  but  not  identical. 

a/ 
The  density  estimate  f  has  applicability  to  problems  other 

than  that  of  classification,  while  the  estimate  f  is  not  so 

versatile . 

In  their  paper,  Loftsgaarden  and  Quesenbury  report  a 

small  empirical  study  of  the  density  estimator  f  when  the  true 

distributions  are  Uniform,  negative  exponential,  and  Normal. 

Based  on  this  study,  they  recommend  that  the  sequence  j (n) 

take  values  not  less  than  n2 . 


In  an  article  published  in  1967,  Cover  and  Hart  (1) 
evaluated  the  rule  of  nearest  neighbor  in  a  slightlv  different 
context  from  that  in  which  the  previous  investigations  had 
placed  it.   Their  work  is  in  a  Bayesian  context  so  that  there 
is  a  probability  structure  over  the  space  {F,G} 

n,  =  PriZ^F} 

T)      =   Pr{Z^G} 

It  is  assumed  also  that  the  random  sample  of  X's  and  Y's  arise 
in  a  way  so  that  there  is  one  fixed  sample  size  with  the  num- 
ber of  X's  within  that  sample  being  probabilistically  deter- 
mined. 

If  the  classification  loss  function  simply  counts  wrong 
decisions,  i.e.,  the  loss  is  0  or  1  depending  on  whether  the 
observation  to  be  classified  is  assigned  correctly  or  incor- 
rectly; if  R*  designates  the  expected  risk  of  the  Baves  proce- 
dure with  respect  to  a  given  prior  distribution  (n,l-l)  where 
n  =  Pr  {Z^F}  and  if  R  designates  the  expected  risk  (with  re- 
spect to  the  same  prior  distribution)  of  the  rule  of  nearest 
neighbor,  then  the  result  for  discrimination  between  two 
populations  proved  by  Cover  and  Hart  is  given  by  the  follow- 
ing : 

Theorem  3 :   Let  the  space  of  possible  values  of  the  random 
variables  be  a  separable  metric  space.   Let  f  and  g  be  such 
that,  with  probability  one  x  is  either  1)  a  continuity  point 
of  f  and  g,  or  2)  a  point  of  non-zero  orobabilitv  measure. 
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Then  the  expected  risk  R  of  the  nearest  neighbor  procedure 
has  the  bounds 

R*  1  R  1   2R*(1-R*) 
These  bounds  are  as  tight  as  possible. 

A  comparable  bound  is  obtained  for  the  case  of  discrim- 
ination among  several  populations. 


20 


III.   A  LIKELIHOOD  RATIO  DISCRIMINANT 

As  was  noted  in  the  last  section,  when  the  probability 
structure  of  the  two  populations  to  be  discriminated  is  known 
completely  the  likelihood  ratio  criterion  gives  the  solution 
to  the  classification  problem:   that  is,  classify  z  as  dis- 
tributed according  to  F  if 

f  (z) 

)         >  t   for  some  t ,  0  <  t  <  °° 
g  (z)  -  -   - 

The  procedure  which  Fix  and  Hodges  selected  with  which  to  com- 
pare the  rule  of  nearest  neighbor  was  the  linear  discriminant 
function,  since  that  procedure  is  known  to  be  ootimal  under 
the  assumption  that  the  populations  under  consideration  are 
Normally  distributed  with  the  same  covariance  matrix.   Inves- 
tigation of  the  linear  discriminant  reveals  that  it  is  the 
likelihood  ratio  procedure  using  the  estimates  of  the  poDula- 
tion  means  and  the  common  co-variance  matrix  as  thouqh  thev 
were  known  to  be  correct.   Hager's  investigation  indicated 
that  the  use  of  the  linear  discriminant  when  the  populations 
have  the  negative  exponential  distribution  can  give  verv  poor 
results  and  that,  in  general,  the  probability  of  misclassifi- 
cation  is  divided  very  unevenly  between  P,  and  p„.   It  is  not 
surprising  that  the  linear  discriminant  performs  poorly  on 
distributions  so  radically  different  from  the  Normal  as  the 
negative  exponential.   In  fact,  good  performance  in  this  case 
would  be  quite  surprising. 

In  attempting  to  discover  a  parametric  discrimination 
procedure  with  good  properties ,  one  might  emulate  the  develop- 
ment which  leads  to  the  discriminant  function  and  suqgest  that 
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the  random  sample  of  the  two  populations  be  used  to  estimate 
the  parameter  of  the  distributions.   The  likelihood  ratio 
procedure  could  then  be  carried  out  as  thouqh  the  estimates 
were  known  to  be  correct.   This  procedure  which  will  be 
designated  L(t;X,y)  would  then  be 

m  n 


Let  X  = 


m  n 

i=l    X         i=l    X 


Assign  Z  ^  F  if 

/\         /\    * 

X   e(y-X)z    t   for  some  t   0    t    w  8 

li 

One  may  easily  verify  that  this  procedure  is,  indeed, 

P         P 
asymptotically  optimal.   Since  X  ■>  X  and  y  ->  y  as  n,m  ->  °°, 

this  result  follows  from  Theorem  4  below,  or  from  a  more 

general  theorem  of  Hoel  and  Peterson  (5) . 

Theorem  4  (Fix  and  Hodges) :   If 

a)  the  estimates  (9    }  are  consistent  and 

m,n 

b)  for  every  9,  ffi(z)  and  gfi(z)  are  continuous  func- 
tions of  9  for  every  z  except  perhaps  for  z  S  Nfl  where 
Pr(N„)  =  0  under  the  distribution  given  by  f„  and  that  given 
by  9q/  then  the  sequence  of  discrimination  procedures  ob- 
tained by  applying  the  likelihood  ratio  principle  with  crit- 
ical value  t  >  0  to  f£    (z)  and  ga    (z)  is  consistent  with 

m,n  m,n 

L(t)  . 

It  is  noteworthy  that  the  foregoing  procedure  (and  the 
linear  discriminant  function  as  well)  makes  no  use  of  the 
observation  z  in  determining  the  estimates  of  the  parameters 
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One  might  suppose  that  the  use  of  z  for  this  purpose  would  im- 
prove the  performance  of  the  procedure ,  at  least  for  small 
sample  sizes.   Accordingly  one  could  pose  the  problem  as  one 
of  testing  the  composite  hypothesis  H  •  z  ^  F  against  the 
alternative  H, :  z  "u  G,  using  the  maximum  likelihood  estimates 
X  and  y  in  both  cases  so  that 

?  -     (m+1)      *  _     (n+1) 

a  —  — — ,  y  — 


m  n 

I      ^i+z  I      v.+z 

i=l  i=l   1 

Accept  HQ  if 

X  e(y-X)z  > 

^         — 

This  procedure  which  will  be  called  L(t;X,y)  is,  of  course, 
asymptotically  equivalent  to  L(t;X,y),  so  that  it  too  is  con- 
sistent with  L(t)  and  hence  optimal  in  the  limit. 

In  the  discussion  up  to  this  point,  the  problem  of  the 
choice  of  t  in  the  two  families  of  procedures  which  have  been 
proposed  has  not  been  considered.   The  following  lemma  will 
clarify  the  problem. 

Lemma  1 :   If  t  is  restricted  to  be  a  constant  in  the  procedure 
L(t;X,y)  or  L(t;X,y)  as  X,  y  range  over  the  parameter  space, 
then  if  t  ^  1,  as  m,n  -*■   °°  for  any  e  >  0  there  exists  6  so  that 
if  |l  -  j\     <    6 ,  p.  >  1  -  e  or  P2  >  1  -  e. 

Proof:   The  procedure  L(f,X,y)  requires:   assign  Z  ^  F  iff 

X  e(y-X)z  >  t 
M 
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or 

(y-X) z  >  £n  ^  +  £n  t   . 
X 

„  P     -  P 
Let  111,11  •>  »  so  that  X  -*•  X ,  y  ->  y  and  suppose  (without  loss  of 

generality)  that  X  <  y.   Then  the  procedure  assigns  Z  ^  G  in- 
correctly if  and  only  if 

&n  3r-  0 

X  £n  t 

z  <  r  +  r 

y-X  y-X 

Now  suppose  t  >  1;  since 

£n  T    B 
P,  =  Priassign  Z  a.  g|z  'v  f}  =  Pr{Z  <  £  +     ,  } 

1  ^       '  y-X    y-X 

=  i  -  exp  { i  +  **-*  } 

1-  H   i-  H. 

x   X    x   X 

=  i  -  <!*>  x 

the  desired  inequality  is  achieved  if 

<*£)    X   <  £ 

or 

y 

i  *   i 
E(i)   <  i_ 

XV     tE 
since,  by  assumption,  y  >  1  and  t  >  1,  there  exists  6  >  0  so 
that  if  y  -    1    <    6  the  inequality  above  is  satisfied  and  the 
desired  conclusion  follows.   If  t  <  1  a  similar  argument  shows 
that  for  appropriate  values  of  p  P»  >  1  -  e.   Since  L(t:X,y) 
is  asymptotically  equivalent  to  L(t,X,y)  the  result  follows 
for  the  former  procedure  as  well. 
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It  is  noteworthy  that  for  t  =  1  as  m,n  -*■   °° 


in  % 
lim   P,  =  lim   1  -  exp{ } 

=  1  -  e"1 


and  similarly 

-1 
lim   P„  =  e 

(The  subscripts  on  the  P's  are  reversed  as  y-*-l   ) 

In  fact,  it  is  easily  verified  that,  for  t  =  1  as  m,n  -*•  °° 
if  %  <  y  <  2  either  P,  or  P~  is  greater  than  one-half.   This 
is  not  a  desirable  situation;  however,  it  is  better  than  the 
situation  which  obtains  in  the  use  of  the  linear  discriminant 
function  where,  as  Hager  discovered,  for  .386  3  -  [2(tn2)    -    1] 
<  I  <  l/[2(£n2)  -  1]  -  2.589,  F,  >  %  or  P2  >  %.   Recall  that 
the  rule  of  nearest  neighbor  has  both  error  probabilities 
bounded  above  by  %  as  m,n  •*>   °°  irrespective  of  the  nopulation 
distributions . 

The  above  results  are  asymptotic  and  imply  little  about 
the  performance  of  the  procedures  for  small  samples.   They  do, 
however,  sharpen  the  problem  which  must  be  faced  in  using  the 
L(t;A,y)  procedure.   Either  t  is  fixed  at  1  (for  if  t  /  1  the 
procedure  may  become  arbitrarily  bad  as  m,n  -*■  °°)  or  t  is  made 
a  function  of  the  observations.   If  the  latter  course  is  elec- 
ted, one  might  be  interested  in  preventing  the  possibility  of 
misclassifying  an  observation  with  higher  orobability  than 
one-half.   A  plausible  way  to  pursue  this  goal  would  be  to 
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seek  a  minimax  procedure;  i.e.,  one  which  would  make  P,  equal 
to  P?.   To  do  this  one  would,  given  the  estimates  X,y  seek 
t  =  t (X ,  y)  so  that 
P, 


{Z:  4e(^)Z  <  t}  =  P(Z:  I  e(^)Z  >  t} 
F    y  G    y 


and  use  this  value  of  t  for  the  discrimination.   The  perform- 
ance of  this  "minimax"  procedure  is  reported  in  this  thesis. 

In  the  foregoing  material,  the  ratio  —  has  occurred  fre- 
quently.  It  would  be  desirable  for  a  discrimination  procedure 
to  depend  on  the  parameters  of  the  distributions  onlv  through 
this  ratio.   Indeed  this  is  true,  for  both  L(t;X,y)  and 
L  ( t ;  X  ,  y )  . 

Theorem  5 :   In  the  procedures  L(t;X,y)  and  L(t;X,y), 
P.  =  Pr(assign  Z  ^  G | Z  ^  F}  depends  on  X,y  only  through  c  =  — . 
A  lemma  will  be  established  first: 


Lemma  2 :   If  X  has  the  negative  exponential  distribution  with 
parameter  X,  then  X  is  distributed  as  (-In   U)/X  where  U  has 
the  Uniform  (0,1)  distribution. 
Proof  of  lemma: 

Pr{X  _<  x}  =  F(x)  =  1  -  e"Xx 

Pr("£"  U  _<  x}  =  PrUn  U  >_  -  Xx} 

=  Pr{U  >  e"Xx} 

,     -Xx 
=  1  -  e 

The  result  follows  by  the  Caratheodorv  extension  theorem. 


26 


Proof  of  Theorem  5 

Suppose  n  =  m  =  1.   Then  for  the  procedure  L(t;X,y) 


Prlassign  z  ^  g|z  ^  F}  =  Pr{rrexp 


f|"|>*]  1  t|Z  *  F> 


=  Pr{:   „_  TT  exp 


Pr{ 


y  £n  U 
An  V 


exp 


f(-y L 

[_vAn  V  An 
L  An  V  An 


Un  W) 


u» 


:)  (-An  W) 


'An  U   -  |_  An  V  in   U' 
where  U,  V,  and  W  are  independent  and  identically 
uniformly  distributed  on  (0,1)  by  lemma  2. 


<  t> 

<  t} 


Similarly  for  L(t;X,y), 

Pr(assign  Z  y   g|z  %  f}  =  Pr{~|-  exp   ( 


X+Z    Y+Z 


]  '  fe| 


Z  a-  F} 


=  Pr 


in   V, An  W 
An  U  &n  W 


exp 


2 \  (-An  W) 


in   UJn  W     An  V,  An  W 

— ; 1 : T— 


1 


,   cAn  V+An  W 
Pr  <     in   U+An  W  eXp 


An  U+An  W    cAn  V+An  W, 


(-An  W) 


<  t 


<  t 


The  result  for  arbitrary  m,n  follows  by  induction. 
Note  that  P2  =  Pr{assign  Z  ^  F | Z  ^  G} 

=  Pr{i  exp[(y-X)Z]  >  t| Z  ^  G} 


=  Pr{-£  exp[(X-y)Z]  <  l/t|z  ^  G} 

A 

is  equal  to  P,  for  the  situation  in  which  X  and  y  have  been 
interchanged  and  t  replaced  by  1/t,  i.e.,  P2  for  L(t;X,y)  equals 
P,  for  L(l/t;y,X).   A  similar  statement  is  valid  for  L(t;X,y). 

In  seeking  the  error  probabilities  of  the  procedure 
L(t;X,y)  and  L(t;X,y)  one  must  calculate 
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Pn   =  P{i  e(n-*>z  <  tiz  ^  F} 


1  u 

=   P{£n   i  -    in   i+    (y-X)Z    <    £n    t|z    ^   F} 

y     x 

^  A   rn  n 

where  in  procedure  L(t;X,y)  ,  ■*-  ^  Gamma  (X,m)  ,  -*   ^  Gamma  (y,n)  , 

X  y 

^  ^   m+1 
z  ^  Gamma  (X,l)  ,  and  in  procedure  L(t;X,y)  ,  —??—   =  U  +  Z  where 

A 

U  ^  Gamma  (X,m) ,  Z  ^  Gamma  (X,l)  so  that  U  +  Z  ^  Gamma  (A, m+1) , 

n+1  ~  A 

-7T—  =  V  +  Z  where  V  ^  Gamma  (y,n).   In  the  L(t;X,y)  procedure, 

y 
if  t  is  a  constant,  it  appears  that  P,  should  be  calculable 

by  a  straightforward  triple  numeral  integration.   In  the 
L(t;X,y)  procedure  the  boundary  of  the  region  of  integration 
for  Z  involves  the  solution  of  a  transcendental  equation,  but 
this  too  may  be  done  numerically  and  P,  calculated  for  fixed 
t.   However,  when  t  is  permitted  to  be  a  function  of  the  obser- 
vations, the  integral  becomes  intractable.   For  this  reason, 
and  because  the  investigator  wished  to  compare  the  performance 
of  the  Likelihood  Ratio  procedures  to  that  of  the  Loftsgaarden- 
Quesenbury  procedure  which  is  almost  impossible  to  assess 
analytically,  the  decision  was  made  to  conduct  this  investiga- 
tion through  a  Monte-Carlo  study.   The  following  procedures 
were  investigated 

1)  L(t;X,y)   t  =  1 

2)  L(t;X,y)   t  =  1 

3)  L(t;X,y)   "minimax" 

4)  L(t;X,y)   "minimax" 

5)  Rule  of  nearest  neighbor 

6)  Lof tsgaarden-Quesenbury  procedure  L*(t;f,g)   t  =  1 
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The  computer  program,  run  on  an  IBM  360  computer,  gener- 
ated, by  means  of  the  probability  integral  transform,  the  ran- 
dom sample  of  X's  and  Y's,  and  the  observation  Z  to  be  clas- 
sified.  The  various  classification  procedures  were  performed 
and  correct  or  incorrect  classification  of  z  was  recorded. 

The  Monte  Carlo  procedure  may  be  viewed  as  an  attempt  to 
estimate  the  parameter  p  of  a  Bernoulli  random  variable;  i.e., 
the  probability  with  which  a  randomly  selected  observation  will 
be  misclassified.   As  such,  the  distribution  of  the  estimates 
which  have  been  obtained  may  be  estimated.   Since  p  is  reason- 
able close  to  one-half  in  all  cases,  and  since  10,000  replica- 
tions of  the  Monte  Carlo  procedure  were  summed,  it  may  be 
assumed  that  the  estimate 

10,000 

.  I     Bi 

i  =  1   1 
P  = 


10,000 

where  B.  =  0  with  probability  (1-p) ,  1  with  probability  d, 
has  approximately  the  Normal  distribution  with  mean  p  and  var- 
iance ?Q  Q^Q  <  .25  x  10   .   Hence  a  95%  confidence  interval 
may  be  formed  for  the  value  of  p  in  each  case 

.95  =  Pr{ |p-p|  <  1.96a} 

£  Pr{ |p-p|  <  .0098}   . 
For  comparison  with  these  results,  the  analyticallv  com- 
puted misclassification  probabilities  of  the  rule  of  nearest 
neighbor  and  linear  discriminant  function  obtained  by  Hager 
are  reproduced. 
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Table  1 
Misclassification  Error  Probabilities  for  Procedures 

Description  of  Procedures: 

a.  f\j 

1.  L(t;  X,y)      t  =  1 

2.  L(t;  X,y)     t  =  1 

3.  L(t;  X,y)      "minimax" 

4.  L(f,  X , y )      "minimax" 

5.  L*(t;  f,  g)    t  =  1   "Rule  of  Nearest  Neighbor" 

6.  L*(t;  f,  g)    t  =  1   Loftsgaarden  and  Ouesenbury  Procedure 

j(n)  =  Mn)  =  n% 

7.  "Rule  of  Nearest  Neighbor"  -  from  Hager  (4) 

8.  Linear  Discriminant  Function  -  from  Hager  (4) 


C  =   X/y 

N  =   size  of  sample  from  each  pooulation  upon  which  classifica- 
tion procedure  is  based 
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Table  2 

EXPECTED  RISK  FOR  ALL  PROCEDURES  WITH  PRIOR  (.5,  .5) 
Procedure    12345678 

C    N 

1.5    1    .488  .491  .495  .495  .487  .487  .488  .488 

2    .482  .483  .481  .483  .488  .488  .486  .480 

5    .470  .471  .476  .475  .487  .480  .484  .467 

10    .453  .454  .455  .454  .483  .478  .483  .455 

20    .444  .443  .448  .448  .480  .470  ****  .442 

.426  .426  .430  .430  ****  .426  .481  .426 

2.0    1    .465  .467  .462  .466  .468  .468  .467  .467 

2    .448  .450  .451  .449  .460  .460  .461  .447 

5    .410  .410  .417  .414  .456  .444  .456  .416 

10    .393  .392  .401  .400  .453  .427  .453  .394 

20    .377  .377  .381  .381  .453  .420  .452  .381 

.375  .375  .382  .382  ****  .375  .451  .375 

3.0    1    .419  .425  .427  .431  .425  .425  .424  .424 

2    .378  .380  .386  .386  .414  .414  .411  .385 

5    .333  .333  .337  .336  .401  .378  .401  .338 

10    .315  .316  .326  .326  .391  .361  .397  .319 

20    .311  .311  .324  .323  .398  .350  .395  .313 

.308  .308  .318  .318  ****  .308  .395  .309 
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Table  2  (Continued) 

EXPECTED  RISK  FOR  ALL  PROCEDURES  WITH  PRIOR  (.5,  .5) 
Procedure    12345678 

C    N 

5.0    1    .352  .358  .355  .366  .365  .365  .361  .361 

2    .289  .294  .304  .307  .340  .340  .338  .307 

5    .248  .250  .262  .262  .320  .296  .322  .264 

10    .235  .237  .249  .249  .314  .269  .319  .255 

20    .234  .235  .245  .245  .316  .258  .318  .253 

.233  .233  .245  .245  ****  .233  .319  .250 

10.0   1    .258  .270  .259  .285  .285  .285  .286  .286 

2    .196  .206  .211  .224  .249  .249  .248  .235 

5    .166  .169  .180  .184  .227  .205  .226  .215 

10    .158  .160  .173  .175  .222  .181  .222  .213 

20    .155  .155  .167  .168  .222  .168  .221  .213 

.152  .152  .165  .165  ****  .152  .222  .214 

20.0   1    .194  .208  .184  .228  .231  .231  .233  .233 

2    .133  .141  .140  .161  .184  .184  .184  .201 

5    .105  .107  .114  .124  .153  .149  .153  .198 

10    .098  .101  .111  .115  .145  .123  .146  .201 

20    .092  .093  .104  .106  .142  .105  .145  .202 

.094  .094  .106  .106  ****  .094  .145  .204 
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IV.   SUMMARY  AND  CONCLUSIONS 

A  number  of  interesting  facts  are  evident  from  insoec- 
tion  of  the  results  of  the  investigations  conducted  in  this 
thesis.   Perhaps  the  most  startling  is  that  for  values  of 
c  not  greater  than  5  and  all  sample  sizes  uo  to  and  includ- 
ing 20  the  expected  risk  (with  prior  (%,%)  )  of  the  linear 
discriminant  function  is  uniformlv  smaller  than  that  for 
either  of  the  non-parametric  orocedures  (see  Figure  1) .   The 
linear  discriminant  is  equivalent  to  procedure  L(t;A,y)  with 
t  chosen  in  a  somewhat  bizarre  fashion,  since  it  divides  the 
positive  line  into  two  intervals  which  are  acceptance  re- 
gions for  {Z  *\<  F}  and  {Z  ^  G}.   Hence  the  linear  discrim- 
inant minimizes  P~  for  the  P,  which  it  achieves,  and  though 
the  division  of  the  total  error  probability  is  very  uneven, 
the  average  is  small  enough  to  better  the  non-parametric 
procedures . 

Also  interesting  is  the  fact  that  the  expected  risks  of 
procedures  L(t;A,i_t)  and  L(t;X,y)  are  almost  identical  even 
for  very  small  sample  sizes.   In  general  P,  is  larger  for 
L(t;A,y)  than  for  L(t;X,y)  but  P„  for  the  latter  nrocedure 
is  smaller  so  as  to  keep  the  average  almost  constant.   The 
"minimax"  procedure  appears  to  achieve  the  desired  equaliza- 
tion of  P,  and  P~  fairly  well  for  moderate  samole  sizes 
(n  >  10) ,  but  fails  quite  badly  for  n  =  1  or  2 .   It  appears 
that,  for  n  >  5  the  average  risk  is  not  increased  aooreciablv 
by  using  the  "minimax"  procedure. 
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The  negligible  imorovement  in  the  performance  of  the 
likelihood  ratio  discriminant  procedures  for  sample  sizes  in 
excess  of  10  and  the  extremely  slow  approach  to  optimalitv 
of  the  Loftsgaarden  and  Ouesenbury  procedure  are  also  inter- 
esting.  An  example  of  this  for  c  =  10  is  shown  in  Figure  2. 

The  considerable  disparity  of  the  values  of  P.  and  v 
for  many  of  the  procedures  considered  in  this  thesis  raises 
an  interesting  philosophical  point  which  an  investigator 
should  settle  for  himself  before  selecting  one  of  these  meth- 
ods for  use.   If,  for  example,  one  is  willing  to  accept  the 
possibility  that  a  large  percentage  of  the  members  of  one 
population  will  be  misclassified,  although  the  averaae  num- 
ber of  misclassifications  is  apt  to  be  moderate,  then  the 
use  of  the  linear  discriminant  function  mav  be  preferable  to 
the  use  of  the  non-parametric  procedures  (unless  c  is  verv 
large).   If,  however,  one  is  reassured  by  the  fact  that  the 
rule  of  nearest  neighbor  makes  errors  no  more  than  half  the 
time  (asymptotically)  no  matter  what  the  situation,  one  mav 
have  a  predilection  for  that  procedure.   The  superiority  in 
terms  of  expected  risk  of  the  linear  discriminant  function 
over  the  non-parametric  procedures  for  small  c  is  shown  in 
Figure  1  where,  for  example  for  n  =  2,  c  =  5  the  linear  dis- 
criminant has  expected  risk  about  .03  lower  than  the  rule  of 
nearest  neighbor;  for  n  =  °°,  c  =  5  the  difference  is  almost 
.06.   In  fact,  the  performance  of  the  linear  discriminant 
where  c    3  is  almost  identical  with  that  of  the  best  proce- 
dure  in  this  range,  L(l;A,i_i).   However,  reference  to  figure 
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3  indicates  that  in  the  same  cases,  P~  for  the  linear  discrim- 
inant is  much  greater  than  that  for  the  rule  of  nearest  neigh- 
bor.  Also  apparent  in  Figure  3  is  the  non-monotonicitv  of  P„ 
for  several  procedures.   Table  1  gives  both  P.  and  P„  for  all 
cases  considered  in  this  thesis  so  that  expected  risks  for 
mixing  probabilities  other  than  (%,%)  may  be  easily  calcula- 
ted. 

The  following  recommendations  seem  approoriate  based  on 
this  study.   If  one  can  be  reasonably  certain  that  the  pop- 
ulations are  negative  exponential ,  and  there  is  no  reason  to 
suppose  that  the  unknown  observation  is  more  likely  to  be 
from  one  of  the  populations  than  from  the  other,  the  minimax 
version  of  L(t;A,y)  (Procedure   3)  would  be  a  good  choice  if 
n  >  5.   For  smaller  samples  the  same  orocedure  with  t  =  1 
(Procedure  1)  seems  better.   If  observations  from  one  of  the 
populations  are  appreciably  more  likely  than  those  from  the 
other,  a  procedure  taking  this  fact  into  account  by  taking 
more  observations  from  the  more  likely  ponulation  and/or 
estimating  the  probability  of  occurrence  of  the  ponulations 
(if  these  probabilities  are  not  known)  should  be  considered. 
A  selection  of  the  parameter  t  in  the  chosen  procedure  in 
order  to  minimize  the  expected  risk  with  respect  to  the 
estimated  (or  known)  poDulation  probabilities  could  then  be 
made.   Because  the  probability  of  classification  error  does 
not  decrease  appreciably  as  n  increases  from  ten  to  infinity 
for  the  likelihood  ratio  procedures,  it  aooears  that  the  use 
of  samples  larger  than  ten  in  Procedures  1  -  4  is  unwarranted 
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unless  the  cost  of  sampling  is  verv  small.   If  one  cannot  be 
certain  that  the  populations  are  negative  exponential,  a 
choice  between  linear  discriminant  and  a  non-narametric  nro- 
cedure  may  be  appropriate.   The  attitude  of  the  experimenter 
toward  the  importance  of  P,  and  P_  individually  should  in- 
fluence his  decision  in  this  case. 
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